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Abstract 

An open problem that arises when using modern iterative linear solvers, such as the preconditioned 
conjugate gradient (PCG) method or Generalized Minimum RESidual method (GMRES) is how to 
choose the residual tolerance in the linear solver to be consistent with the tolerance on the solution 
error. This problem is especially acute for integrated groundwater models which are implicitly coupled 
to another model, such as surface water models, and resolve both multiple scales of flow and temporal 
interaction terms, giving rise to linear systems with variable scaling. 

This article uses the theory of 'forward error bound estimation' to show how rescaling the linear 
system affects the correspondence between the residual error in the preconditioned linear system 
and the solution error. Using examples of linear systems from models developed using the USGS 
GSFLOW package and the California State Department of Water Resources' Integrated Water Flow 
Model (IWFM), we observe that this error bound guides the choice of a practical measure for con- 
trolling the error in rescaled linear systems. It is found that forward error can be controlled in 
preconditioned GMRES by rescaling the linear system and normalizing the stopping tolerance. We 
implemented a preconditioned GMRES algorithm and benchmarked it against the Successive- Over- 
Relaxation (SOR) method. Improved error control reduces redundant iterations in the GMRES 
algorithm and results in overall simulation speedups as large as 7.7x. This research is expected to 
broadly impact groundwater modelers through the demonstration of a practical approach for setting 
the residual tolerance in line with the solution error tolerance. 

Introduction 

As the groundwater model infrastructure grows to comprehensively and accurately resolve hydro- 
logical processes so too does the need for improved solver technology to address new modeling 
features and take advantage of faster computers. Consequently, there are now a wide range of 
iterative linear solvers available in ground water mode ling packages. Examples include the precondi- 
tioned conjugate grad ient (PCG) method (|Hilll . [l99oT ) . the link-algebraic multi-grid (LMG) Package 



(|Mehl and Hill 120011 ). the algebraic multi-grid (AMG) solver and the generalized conj ugate gradient 
method. The first three o f these are provided with M ODFLOW-2005 (|Harbaughl [200511 and the latter 



is provided in SEAWAT fcuo and LangevinL l2002h . which couples MODFLOW-2005 and MT3DMS 



(jZheng and Wang . Il999h ;o simulate ground water flow with variable density and temperature 



Iterative linear solvers can be broadly categorized into modern, projection based, solvers or classi- 
cal (stationary) solvers. Both can be further categorized into solvers for symmetric or non-symmetric 
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linear systems. The PCG method remains one of the most competitive modern solvers widely used 
for groundwater modeling. Whilst this solver is designed for symmetric matrices only, it shares a 
common property with numerous other modern solvers, namely that the most robust stopping cri- 
teria is based on the residual erroi0. It is, however, the solution error which is the most physically 
relevant error measure - solving a linear system to within 1% accuracy in the groundwater heads is 
more relevant than to within 1% accuracy of the linear system residual. An open problem is how 
to control the solution error given a stopping toler ance on the residual error, w hich behaves very 
differently when the linear system is ill-conditioned ()Axelsson and Kaporin , 2001 ) . 



The issue of choosing a residual based stopping criteria becomes even more apparent when a 
projection based iterative solver is embedded in an iterative linearization method (such as the Pi- 
card or Newton method), which is generally necessary for modeling saturated groundwater flow. A 
relative error based stopping tolerance for the linearization procedure is generally not compatible 
with a residual based stopping criteria in the linear solver. In this situation, the only remedy is a 
judicious choice of stopping tolerance on the residual error, often based on trial and error, which 
ensures the linearization procedures quickly converges without redundant linear solver iterations. 
The problem with this approach is that to achieve a target solution error throughout the simulation, 
the corresponding solver error tolerance may need to change as the linear systems are forced with 
temporal effects such as pumping, stream seepage or wetting and drying of the aquifers. This makes 
the task of choosing the stopping tolerance even more difficult and there may be little choice but to 
over-specify the tolerance, before running the simulation, at the expense of excessive linear iterations 
and performance lagging. 

The multiple scales of flo w in groundwater mode ls coupled with surface water or contaminant 
transport such as GSFLOW dMarkstrom et all l2008h or MT3DMS further highlight the issues asso- 
ciated with linear solver error control. iBlom et al.l (|1993l ) consider the scaling issues arising between 
the solution components of a model for brine transport in groundwater flow. They use a weighted 
norm in the linear solver stopping c riteria in order to ensure that each solution component is solved 



to its corresponding data accuracy. IBlom et al.l (|1993i ) further considered the influence of the linear 



solver error on the convergence of a Newton-type method. They propose a fixed bound on the linear 
solver error, referred to as the forward error which is shown to be inversely proportional to the max- 
imum number of Newton iterations. They don't, however, show how this forward error is related to 
the stopping tolerance and thus it is unclear as to how this approach can be efficiently implemented 
in practice. 

This paper shows how re-scaling poorly scaled linear systems is a key step towards improving con- 
trol of the forward error, since the (normalized) stopping tolerance on the residual norm becomes 
a practical proxy for the upper bound on the forward error. Improved control eliminates unneces- 
sary linear solver iterations without compromising the desired accuracy of the solver. For saturated 
flows, too much error can reduce the convergence rate of the linearization procedure and even prevent 
convergence. 

We demonstrate the practical benefits of rescaling the linear system in two different integrated 
surface water an d groundwater models - the GSFLOW package and the Integrated Water Flow 
Model (IWFM) (|Dogrul and Kadir . 2007 ). IWFM is a water resources management and planning 



tool which simulates groundwater, surface water and stream-groundwater interaction. This model is 
currently being used by the State of California Department of Water Resources in computationally 
demanding long-time high resolution applications such as assessing the impact of climate change 
on water resources and the analysis of different conjunctive use scenarios across California. IWFM 
uses an implicitly formulated Galcrkin finite element method over a non-uniform areal 2D mesh to 
simulate the nonlinear groundwater head dynamics in multi-layer aquifers. 

Whilst both GSFLOW and IWFM implicitly couple the surface water flow with the groundwater 
flow, IWFM combines both sets of flow equations into an integrated linear system with a non- 
symmetric matrix and hence the PCG solver can no longer be used. The need for non-symmetric 
matrix solvers is a growing trend in groundwater modeling. As previously mentioned, SEAWAT uses 



1 The stopping criteria in the Generalized Minimum RESidual method (GMRES) can only be based on an estimate of 
the residual error norm. 
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a generalized conjugate gradient method which is suitable when the matrices are nearly symmetric. 
The need for faster local converge rates than attainable using Picard methods motivates the use 
of Newton-type methods for saturated ground water models. By using full upstream weighting of 
the saturated thicknesses to compute the inter-cell conductances in MODFLOW-2005, Newton-type 
methods give non-symmetric linear systems and require different solvers and settings. iMehll (2006) 
concludes that further exploration into the different solvers and their settings is needed before this 
approach can widely catch on. To partially address this call for further exploration, we provide an 
overview of a promising modern iterative solver for a wider class of non-symmetric matric es, referred 
to as the Generalized Minimum RESidual (GMRES) method ()Saad and Schultz . Il986f). which is 
known to be superior to the classical Successive Over Relaxation (SOR) method (|Van der Vorstl . 
1990HPadilla et al.l . l2008h . 



Overview 

The next Section describes the properties of linear systems that are most performance critical for 
any iterative solver and identifies the prevailing features of numerous case studies to explain why 
integrated surface water and groundwater models are more difficult to solve. Almost all of the linear 
systems that we consider have non-symmetric coefficient matrices and exhibit scaling difficulties. We 
then briefly review the mathematical formulation of GMRES, widely used for scientific computations 
involving non-symmetric linear systems. This solver has only received marginal coverage within 
the groundwater modeling literature and we point out some of its salient features and practical 
implications for groundwater modeling. We also provide the algorithm parameter settings that lead 
to best performance and benchmark it against the classical SOR method. The key contribution of 
this paper is to provide insight into how to choose the residual stopping tolerance in any modern 
iterative linear solver (which uses residual based stopping criteria), not just the GMRES method. 
Whilst this is an open problem, we show that error control in GMRES can be improved by rescaling 
the linear system if it is poorly scaled, as may arise when groundwater models are integrated with 
surface water flow models such as IWFM and GSFLOW. 



Profile of the Linear System 

At each time step in a saturated groundwater model si mulation, a lin earization procedure such as the 
Picard or Newton iterative methods (see for example (Mehl, 2006)) solves the system of saturated 
groundwater flow equations 

F(H k+1 ) = 

in which H k+1 is the vector of unknown multiple-layer aquifer heads over a 2D bounded domain at 
iteration k + 1 . This definition is general enough to include implicitly coupled integrated groundwater 
and surface models (such as IWFM), which also include stream and lake surface elevations in the 
vector of unknowns. For ease of exposition we present the linear system in canonical form by denoting 
the difference vector x = H k+1 —H k without an iteration index, the Jacobi (or approximate Jacobian) 
matrix A — VF{H k ) with elements = J^r and the right-hand side vector b = —F{H k ) to give 

3 

Ax = b, A e R NxN , x, b e R N , (I) 

where A is a positive definitcl square matrix. This paper will just consider the efficient iterative 
approximation of linear system [T] for the case when A is non-symmetric. However, the issue of 
accuracy control of iterative solvers discussed later in this paper also applies to any positive definite 
matrix (symmetric or not). 

Table [T] shows the performance critical properties of the coefficient matrix A for seven datasets 
arising in various groundwater packages. The first four data sets are from applications using IWFM. 
HCMP is a synthetic hydrological comparison dataset, C2VSIM and C2VSIM9 are from respective 



2 A matrix A is positive definite if x T Ax > for all real x ^ 0. 



3 





IWFM 


GSFLOW / MODFLOW 




HCMP 


C2VSIM 


C2VSIM9 


BUTTE 


INCLINE 


NAC 


SAGEHEN 


Dimension 


46460 


4630 


12988 


34683 


42820 


75319 


6784 


NNZ 


479246 


41616 


125616 


188006 


261696 


433791 


31504 


Sparsity(%) 


0.0220 


0.194 


0.0744 


0.0156 


0.0143 


0.00765 


0.0685 


Normality 


0.271 


0.222 


0.908 


0.199 


0.00730 


0.0114 





k(A) 


3.09E6 


2.54E11 


5.13E6 


1.95E9 


1.50E8 


7.25E4 


6.71E8 



Table 1: Linear solver performance critical properties for seven different datasets taken from applications 
using IWFM, GSFLOW and MODFLOW. 



three and nine aquifer layer integrated models of California's Central Valley and BUTTE is from a 
high resolution integrated model of BUTTE county. INCLIN E and SAGEHEN are fro m GSFLOW 
models of the Incline and Sagehen water creeks respectively (see Markstrom et al.l (|2008l ) for details of 



the S agehen water creek example). NAC is from a two-layer model of Nacatoch Aquifer (jBeach et al 



20091 ) which uses MODFLOW-2000. The INCLINE and the NAC datasets were produced using a 



Newton Method in a development version of the MODFLOW package. 

Dimension N is the size of the matrix and NNZ denotes the number of non-zero elements. Each 
matrix is sparse and lacks any block structure. Sparsity is the percentage of the elements in a matrix 
which are non-zero. Normality is the relative measure \\AA* — A*A||/||A|| 2 which is zero when A 
is symmetric. k(A) is the estimated condition numbei|f| of A and is a measure of sensitivity of the 
linear system and the convergence rate of iterative solvers. 

Figure 1 illustrates scaling issues arising in the coefficient matrices from integrated ground wa- 
ter and surface water models. Figures la and lb show the sparsity patterns of the C2VSIM and 
SAGEHEN coefficient matrices respectively. Each Figure has been separated into distinctive zones 
for illustrative purposes and a color scheme arbitrarily differentiates scale. In Figure la, the sparsity 
structure in the uniform 3x3 grid corresponds to the three aquifer layers of the C2VSIM model and 
their interactions with each other. For example, the middle layer interacts with the layer above and 
below and thus exhibits a diagonal band for each together with a central band for the convection and 
diffusion within the layer. The bottom and top layers only interact with one other aquifer layer and 
thus only exhibit two bands. The non-uniform micro-structure within the central band typifies that 
of an unstructured Galerkin finite element groundwater water flow model. The upper left-hand zone 
of Figure la corresponds to the stream nodes and the zones to the right and below correspond to the 
stream-groundwater interaction terms with the top level aquifer. The 2x2 uniform grid in Figure 
lb corresponds to two aquifer layers of the SAGEHEN water creek model and their interactions with 
each other. Both Figures illustrate the scaling issues. In Figure la, matrix elements whose absolute 
values are above and below an arbitrary threshold of O(10 6 ) are shown in red and blue, respectively, 
whilst in Figure lb, the threshold is 0(10). 

Figures lc and Id respectively show the corresponding graphs of the C2VSIM and SAGEHEN 
coefficient matrix element sizes to further illustrate the scaling issues. Figure lc is split into the rows 
corresponding to the surface water and top aquifer layer by the vertical red line. The vertical axis is a 
power scale for the absolute matrix element sizes in each of the first one thousand rows whose indices 
are shown on the horizontal scale. The non-zero matrix elements corresponding to the stream nodes 
in the C2VSIM model are not only much sparser than those corresponding to the aquifer nodes, but 
exhibit a broader range of absolute values. Figure Id is split into the rows corresponding to the top 
and bottom aquifer layers of the SAGEHEN model by the vertical red line. 



Objectives Having illustrated some of the scaling issues arising in the coefficient matrices from 
IWFM and GSFLOW, our objectives are to 



k(A) is estimated using the SuperLU (|Demmel et all 1 19991 ) routines dgscon and dlangs. 
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row index row index 



(c) (d) 

Figure 1: The sparsity pattern of the (a) C2VSIM coefficient matrix and (b) SAGEHEN coefficient 
matrix, (c) A log scale of the absolute value of the elements in the (c) C2VSIM coefficient matrix 
against the row index (which has been truncated to highlight the interface region) and the (d) SAGEHEN 
coefficient matrix against the row index. 

• Solve the linear system (J]) iteratively to a desired level of accuracy in the solution; 

• Assess the impact of the scaling issues on solver robustness and performance; and 

• Provide practical techniques and optimal parameterizations that readers can either implement 
or use to improve solver robustness and performance. 

The following Section describes the use of GMRES with preconditioning to efficiently solve (p}. 
This algorithm terminates when an estimate of the preconditioned linear system residual norm is 
below a tolerance r. The problem is how to choose r so that the solution is to within the desired 
accuracy. If the estimated condition number of the preconditioned coefficient matrix is known, then 
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this task becomes trivial. However, the coefficient matrices are poorly conditioned and may vary 
considerably throughout the simulation. The matrices are also too large for efficient estimation of 
the condition number prior to solving each preconditioned linear system. 

The solution is to rescale the linear system so that the bound of the relative forward error of 
the linear system is about the same order of magnitude as the stopping tolerance normalized by the 
right-hand-side vector. This avoids the computationally prohibitive step of having to estimate the 
condition number of the preconditioned coefficient matrices before each solve and instead provides a 
simple approach with negligible additional computation. 



The GMRES Algorithm and Preconditioning 

The Generalized Minimum RESidual (GMRES) method is a Krylov subspace projection method for 
solving the linear system (|TJ) based on taking the pair of projection subspaccs 

W = IC m (A,r ) and V ^ AW, (2) 

where K, m (A,ro) is a Krylov subspace defined as 

K m (A,r ) :=spa,n{r ,Ar ,A 2 r ,...,A m - 1 r } (3) 

and r = b — Ax with an initial approximate solution x . An approximation solution x € xq + W 
has the form x = q m ~\{A)r Q and Ax — b _L V, where q m (A) is a matrix polynomial of degree to. 

GMRES first uses an Arnoldi procedure to build an orthonormal matrix V m = [v±, V2, ■ ■ ■ , v m ] 
whose column vectors span the subspace W = KL m {A, ro). In matrix notation, the Arnoldi procedure 
can be expressed by the following governing equation 

AV m = V m+1 H m , (4) 

where H m := [H^, h m+l m e m ] T , and H m is to x to upper Hessenberg matrix, and e m is the last 
column to- vector in the identity matrix I m . An iterative solution to the linear system ([T]) can be 
written in the form x m = x + V m y m , where the to- vector y m is the solution to the least squares 
problem 

y m = argmin||r m || = argmin||/?ei - H m y\\, (5) 
y y 

which minimizes the residual. Thus GMRES finds the best x m which minimizes the res idual r m 
by reducing A to H m using the orthonormal bases V m and V m +i- We refer the reader to Demmell 



( 1997t ): ISaadl f20001 'or a more detailed explanation of the GMRES method. GMRES(to) is a memory 



efficient and more stable variant of GMRES, which resets the algorithm after to iterations by setting 
xq = %m so that the memory requirements are 0(mN). m is typically set to between 10 and 20. 

Preconditioning is known as the 'determining ingredient' in the success of the GMRES and other 
iterative methods for solving large scale problems. The convergence rate and computational cost of 
solving the preconditioned linear system 

M~ x Ax = M~ x b 

depend on the choice of the preconditioncr M . The choice of M is typically inferred from experience 
which tells us that the form of M should (i) ensure that n(M~ 1 A) <C n(A) and (ii) be computationally 
inexpensive to solve My = Ax for y given a vector Ax. 

For GMRES, an ideal choice is typically one in which M~ l A is close to normal and whose eigenval- 
ues are tightly clustered around som e poin t away from the origin. The incomplete LU decomposition 
(ILU) is a popular preconditioner ( Saadl . l2000n . For example, considering an LU decomposition 



A = LU , where L is a unit lower triangular matrix and U is an upper triangular matrix. Replac- 
ing non-zero elements of L and U outside the sparsity pattern of A with zero elements gives the 
incomplete factors L and U. An ILU preconditioner is then formed by setting M = LU. 
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A high-level description of the preconditioned GMRES(m) method is provided below. 



Preconditioned GMRES(m) 




Input: A, M 7 b, xo,tti,t 




OutDut: i,„,7„±i , m 




1. compute rn = M _1 (6 — ^4a;n), — 1 1 7"o 1 1 2 and 


vi := ro/0 


2. for j = 1, 2, . . . , m do 




3. solve Mhi = Avj for 11; 




4. for i = 1,2, . . . , j do 




5. /ly = v^u> 




6. w := w ~ hijWi 




7. end do 




8. compute = HHh an d Uj+i = w/ftj 




9. end for 




10. find y m so that 7 m+ i, m := ||/3ei - H m y m \\ = 


argmin y \\0e x - H m y\\ 


11. a; m = x + F m t/ ro 




12. If 7 m +i im < r, stop, else goto Line 1 with xq 





Note that on each iteration of the PGMRES(to) algorithm, the linear system Mw = Avk, where 
Wfe G Vfe, is solved for the vector w. When M is an ILU factorization, M — LU, w is determined by 
forward and back substitutions: 

Lz = Avk and Uw = z, (6) 

where L and U are respectively a unit lower and upper triangular matrix with 2 • Lf il + 1 entries 
per row. The level of fill-in, Lf il, is typically chosen to be between 5 and 10. The PGMRES(m) 
algorithm terminates when the estimated residual norm 7 m +i,m := \\0ei — H m y m \\ satisfies the 
stopping criteria 7 m +i, m < r. 



Scaling and Error Control 

Each stopping tolerance r for the approximate solution x computed by the PGMRES iteration can 
be associated with a corresponding estimate of the upper bound on the relative forward error norm 
||:r — a;||/||x|| of the linear system ([IJ. Whilst the acceptable upper bound on the relative forward error 
norm is determined from the data accuracy, the corresponding tolerance can not easily be implied 
from the data accuracy. Misspecification of the tolerance can result in either over-solution of the 
linear system ([1} or unacceptably high forward error with respect to the data accuracy, especially 
when the coefficient matrix is poorly conditioned and scaled. 

To reduce the difference between the estimated upper bound on the relative forward error norm 
and the residual norm, we introduce a diagonal scaling matrix D so that the preconditioned linear 
system becomes 

M^D^Ax = M~ l D- l b. (7) 
The associated residual r for a n approximate solution x is defi ned as r = M~ 1 D~ 1 (b — Ax). By a 



standard perturbation analysis (jDemmell . Il997t iHighaml . |2002| ). the upper bound, denoted as Ferr, 



on the relative forward error norm can be given in terms of the norm of the residual r: 

^jPjp < 'ir'AJpjJfLj = Ferr, (8, 

where k(M~ 1 D~ 1 A~ 1 ) is the condition number to characterize the difference between the relative 
forward error norm and the ratio of the residual norm to the right-hand side vector norm ||M _1 _D _1 6||. 
The condition number can not in general be efficiently obtained during simulation due to the cost 
and dynamic nature of the linear systems and thus prohibits the evaluation of Ferr explicitly as the 
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With 


row scalin 




Without row scaling 


Inn t 


II 7? — T II / II 77 1 


Fer/r 




Ht — tII / 1 1 t 1 


FG2r2r 




_1 


1.99E-4 


6.24E-2 


1.12E-2 




7 ^7F& 


Q A7F 1 1 

(J .t: 1 1-J L X 


_2 


1.46E-5 


1.81E-3 


1.12E-3 


1.76E-5 


2.59E8 


3.47E-12 


-3 


8.61E-7 


2.30E-4 


1.12E-4 


9.49E-7 


3.29E7 


3.47E-13 


-4 


5.09E-8 


1.11E-5 


1.12E-5 


1.27E-7 


1.11E6 


3.47E-14 


-5 


5.09E-8 


1.11E-5 


1.12E-6 


1.05E-8 


7.13E4 


3.47E-15 


-6 


1.48E-9 


4.24E-7 


1.12E -7 


1.05E-8 


7.13E4 


3.47E-16 


-7 


1.09E-10 


1.82E-8 


1.12E -8 


1.06E-9 


1.79E3 


3.47E-17 


-8 


4.77E-12 


9.51E-10 


1.12E -9 


7.11E-11 


2.03E2 


3.47E-18 



Table 2: For a given stopping tolerance r, this Table compares the exact forward error norm, the 
estimated upper bound on the relative forward error norm and the normalized stopping tolerance e 
from separately solving each of the linear systems M _1 D~ 1 Ax = M~ 1 D~ 1 b (with row scaling) and 
M~ l Ax = M' l b (without row scaling) using PGMRES applied to the C2VSIM dataset. 



stopping criteria for the PGMRES(m) iteration. By choosing D as the sum of row elementsQ 

D = diag(|Ae|i, \Ae\ 2 , . . . , \Ae\ N ), (9) 



we both normalize A and minimize the condition number of D 1 A ( Hie haml l2002h thereby signif- 



icantly sharpening the difference between the estimated upper bound on the relative forward error 
norm and the residual norm. Rescaling is a key step towards control of the forward error, since the 
normalized stopping tolerance e, where t = e||6||*, on the estimated residual norm becomes a practical 
proxy for the upper bound on the relative forward error norm 0. 

To illustrate this property, let us examine the C2VSIM and INCLINE datasets and modify the 
right hand side vectors very slightly so that an exact solution is known for the purpose of the 
proceeding analysis. Let x denote the approximate solution of the original problem D~ 1 Ax = D~ 1 b 
to machine precision. If we now generate a slightly different right hand side vector D~ x b = D~ 1 Ax 1 
then x becomes an exact solution for the 'nearby' rescaled linear system D~ 1 Ax = D~ 1 b. For ease 
of notation, we will simply denote this linear system as D~ l Ax = D~ l b and use x to instead denote 
its approximate solution. 

For a given stopping tolerance r, Tables and [3] show the exact relative forward error norm 
1 1 a; — a;||/||x||, the estimated upper bound on the relative forward error norm Ferr and the normalized 
stopping tolerance e, both with and without row scaling, for the C2VSIM and INCLINE examples 
0. With scaling, the upper bound on the relative forward error norm estimate is the same order of 
magnitude as e, although it is about an 0(IO 2 ) higher than the exact relative forward error norm. 
Without scaling, the estimate of the upper bound on the relative forward error norm is considerably 
larger than the normalized stopping tolerance e and at least an O(10 5 ) larger than the exact relative 
forward error norm. 



Discussion The main implications and limitations of the proceeding analysis are 

• The coefficient matrices arising from integrated groundwater models should always be rescaled, 
irrespective of whether the stream nodes are represented in the coefficient matrix. This obser- 
vation rests on the theory of forward error estimation, which is solver independent; 

4 This choi ce of scaling i s refe rred to as row equilibration, e denotes the unit vector of length N. 
5 Following |Blom et al.l jl993;), we build the sca ling into the norm 1 1 ■ ||, := ||-D _1 ■ ||. 

6 Ferr is computed from using the SuperLU (|Demmel et alj . fl999f ) routines dgscon to estimate the condition number 
k{M~ x D~ x A ). 
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With row scaling 


Without row scaling 


Ion t 








Ht — tII / 1 1 t 1 


Fsrr 




-1 


9 ^fiF 7 


■4:. U^lJ O 


O.OOJIL/-tJ 


K zKlF, 7 
•J .^±\> i j t 


1 1 1 F 9 


9 97F & 


_2 


5.17E-8 


3.55E-6 


3.83E-6 


3.87E-8 


6.32E-3 


2.27E-9 


-3 


4.01E-9 


6.61E-7 


3.83E-7 


7.86E-9 


3.86E-4 


2.27E-10 


-4 


4.61E-10 


3.54E-8 


3.83E-8 


3.33E-10 


1.07E-5 


2.27E-11 


-5 


3.49E-11 


3.68E-9 


3.83E-9 


5.35E-11 


1.05E-6 


2.27E-12 


-6 


6.97E-12 


6.50E-10 


3.83E-10 


5.05E-12 


7.37E-7 


2.27E-13 


-7 


2.27E-12 


8.18E-11 


3.83E-11 


1.56E-12 


1.37E-8 


2.27E-14 


-8 


2.21E-12 


3.80E-11 


3.83E-12 


1.50E-12 


9.60E-9 


2.27E-15 



Table 3: For a given stopping tolerance r, this Table compares the exact forward error norm, estimated 
upper bound on the relative forward error norm and the normalized stopping tolerance e from separately 
solving each of the linear systems M~ 1 D~ 1 Ax = M~ 1 D~ 1 b (with row scaling) and M _1 Ax = M~ 1 b 
(without row scaling) using PGMRES applied to the INCLINE dataset. 



• The stopping tolerance in the GMRES algorithm should be chosen using the relation t = e \b\ |*, 
where e is the desired target accuracy on the relative forward error. Even with e fixed throughout 
a simulation, r becomes a dynamical tolerance as ||6||* varies in size; and 

• The empirically observed close correspondence between e and Ferr follows from the effectiveness 
of the preconditioner to reduce k(M~ 1 D~ 1 A) close to unity. The methodology described above 
is sufficiently broad that readers can follow our approach and establish correspondence for 
alternative choices of preconditioners and solvers. It also leads to the idea that a proxy can be 
found by comparing it with the exact error of the nearby rescaled linear system, without the 
need to estimate the forward error bound (to observe the effects of rescaling). 



Implementation and Performance Benchmarking 

Our Fortran 90 implement ation of PGMRES(m) is adapted from the publically available sparse 
matrix package SPARSKIT (ISaadl . l2000h . This package provides general purpose Fortran 77 routines 



for preconditioning and iteratively solving sparse linear systems using matrices stored in compressed 
sparse row (CSR) format. We developed routines to efficiently convert IWFM storage format arrays 
into CSR format arrays and simultaneously rescale the linear system. 

The PGMRES implementation uses a reverse communication interface- an implementation style 
which enables the computationally intensive modules to be externally referenced. The sparse matrix- 
vector multiplication module and the LU solver are called at each iteration of PGMRES to form the 
Krylov subspace. In line 3 of the PGMRES algorithm, the vector y := Mx — AVj is formed by 
sparse matrix- vector multiplication of A and Vj. The LU solver then solves for a temporary vector 
z using y ~ Lz followed by solving for z using z — Ux, where L and U are computed by the ILUT 
preconditioner prior to execution of the PGMRES algorithm. 

There are two distinct advantages to implementing an iterative solver with a reverse communi- 
cation interface. Firstly, the PGMRES solver is independent of the choice of preconditioner and 
sparse matrix storage format because it does not directly process the matrices A and M. Secondly, 
the bottleneck modules in the solver can be executed in high performance matrix algebra libraries 
which are tuned for the computer architecture. The matrix- vector multiply is classified as a sparse 
Basic Linear Algebra Subroutine (BLAS) level 2 operation and is implemented for a range of parallel 
computing and accelerator platforms including the newly emerging many-core CPU architectures. 

Our numerical experiments are performed using a Linux based Intel Fortran compiler VI 1.0 on a 
2.00GHz Intel(R) Core(TM) 2 Duo CPU (T6400) with 2MB cache. The relaxation parameter for the 
SOR method is set to u = 1.1, the restart threshold of PGMRES is m = 20 and the ILUT (ILU with 
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Table 4: A comparison of the number of iterations of SOR and PGMRES(m) applied to four of the 
datasets as the stopping tolerance r is decremented. 



threshold preconditioner has a drop toleranc^] of 0.01 and maximum fill-irO of p = 10. 



We find by numerical experiment that this choice of PGMRES parameters gives optimal convergence 
rates for all of the non-symmetric datasets described in Table [U In contrast, the optimal choice of 
to varies significantly between each dataset. For the IWFM datasets, 1.1 < w < 1.3 is found to be 
an optimal range, whereas uj = 1.95 and w = 1.5 are found to give optimal convergence rates for the 
NAC and INCLINE datasets. The sensitivity of the SOR method to ui is one reason to avoid using 
it for integrated groundwater flow models because of its sensitivity to changes in the linear systems. 

The number of iterations and elapsed wall clock times for SOR and PGMRES (to) applied to four 
of the datasets are respectively shown in Tables H] and [5] for a range of stopping tolerances r. The 
overall speedup from using PGMRES in place of the SOR solver is highly dependent on the dataset. 
For example, the speedup when using a tolerance of 1 x 10 -8 is approximately 1.5x for the NAC 
dataset, 30x for the C2VSIM and 240x for the BUTTE and INCLINE datasets. The corresponding 
approximation reduction in the number of iterations is respectively 6x, 90x, 800x and 900x. This 
reduction is attributed to the comparative effectiveness of the ILUT preconditioner at reducing 
the condition number of the preconditioned coefficient matrix and is a key feature necessary for a 
dynamical stopping tolerance r = e||Z? _1 6||. In a separate experiment (not shown here), the GMRES 
method only needed about twice as many iterations to converge to machine precision. In contrast, 
the SOR method struggles to reach higher precision. The discrepancy between the realized speedups 
versus the reduction in the number of GMRES iterations is attributed to the cost of constructing 
the preconditioner. Once the preconditioner has been constructed, the relative additional cost of 
converging to higher precision is observed to correspond to the increase in the number of iterations. 

Finally, Table 6 shows the overall performance improvement in the IWFM simulation using the 
PGMRES(m) solver in place of the SOR solver and the proportion (shown in parentheses) of overall 
computation spent in the preconditioner and solvers for three of the datasets. The C2VSIM and 
C2VSIM9 simulations are run over 82 years at monthly increments (984 time steps) and the HCMP 
simulation is run over 2 years at weekly increments (104 time steps). The normalized tolerance e in 
the linear solver was fixed throughout the simulation at 1 x 10 -5 which is 0.1 times the relative data 
accuracy of 1 x 10~ 4 . 

As expected, the overall performance gains from using PGMRES(m) are more prominent with the 
larger datasets since the absolute time reduction is most significant- HCMP and C2VSIM9 exhibit 
7.74x and 7.56x speedups respectively. The reduction of the total time spent in the solver is also 
considerably more significant for the C2VSIM9 dataset than the HCMP and C2VSIM datasets - 
64. 3x reduction versus 14. 4x and 19. Ox respectively. This relative improvement is different from the 



7 An element is replaced by zero if it is less than the drop tolerance multiplied by the original norm of the row containing 
the element. 

8 Only the p largest elements in each upper and lower factor matrix are retained, the remainder are replaced by zero. 
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Table 5: A comparison of elapsed wall clock time (in seconds) of SOR and PGMRES(m) applied to four 
of the datasets as the stopping tolerance r is decremented. All timings are reported to three significant 
figures. 





HCMP 


C2VSIM 


C2VSIM9 


IWFM (SOR) 
IWFM (PGMRES) 


20.4 (84.0%) 
2.63 (45.1%) 


15.59 (79.0%) 
7.12 (9.12%) 


121 (82.0%) 
16.0 (9.65%) 


Speedup 


7.74x (14.4x) 


2.2x (19. Ox) 


7.56x (64.3x) 



Table 6: This Table shows the time in minutes and proportion of IWFM simulation time (in parenthesis) 
spent in the solvers for each of the datasets. The bottom row shows the speedup in IWFM simulation 
time and total solver time (in parenthesis) if the SOR solver is replaced by PGMRES. The C2VSIM and 
C2VSIM9 simulations use 984 time steps and the HCMP simulation uses 104 time steps. All timings 
are reported to three significant figures. 



isolated benchmarking results (some of which are shown in Tables |4] and El) in which the speedup 
from using PGMRES is 150x, lOx and 30x when using the C2VSIM9, HCMP and C2VSIM datasets 
respectively. This indicates that the linear system used for the linear solver benchmarking in Tables 
U] and [5] leads to optimistic performance gains compared to the total linear solver time reduction 
over the simulation, in which the linear systems may change significantly. With PGMRES, iterative 
solution of the linear system ceases to be a major bottle neck for the C2VSIM and C2VSIM9 datasets 
- only 9.12% and 9.65% of the overall simulation time is spent in the linear solver. This is the reason 
why the IWFM speedups of 2.2x and 7.56x are significantly lower than the total solver time reductions 
of 19. Ox and 64. 3x. Conversely, the linear solver is still a major bottleneck in IWFM with the HCMP 
dataset accounting for 45.1% of the IWFM simulation time. 

Conclusion 

An open problem that arises in many modern iterative linear solvers is how to choose the residual 
tolerance in the linear solver to be consistent with the tolerance on the solution error. This article 
firstly illustrates the scaling issues arising in linear systems from integrated groundwater and surface 
water models. It is shown that re-scaling is a key step towards control of the forward error, irrespec- 
tive of the choice of solver and preconditioner. We implemented a preconditioned GMRES algorithm 
and observe that the normalized stopping tolerance on the estimated residual norm becomes a proxy 
for the estimated upper bound on the forward error when the linear systems are rescaled. Fur- 
thermore, the rescaling and modified error control are simple to implement at negligible additional 
computational cost. 



11 



This article demonstrates a number of favorable properties of PGMRES: (i) the optimal ILUT 
preconditioner parameter settings are independent of the dataset; (ii) it is well suited to adaptive 
residual error control; (hi) when benchmarked against the SOR method, the comparative speedups 
with rescaling and error control can lead to significant overall application speedups (as high as 7.7x for 
IWFM); and (iv) performance profiling shows that the new linear solver removes a major performance 
bottleneck in IWFM. 

This research is expected to broadly impact groundwater modelers by demonstrating a practical 
and general approach for setting the residual tolerance in line with the solution error tolerance. 
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